[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization #29283

HyukjinKwon · 2020-07-29T08:37:04Z

What changes were proposed in this pull request?

This PR proposes to:

Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example,

df <- createDataFrame(list(list(a=1L, b="2")))
count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))

Before:

Error in handleErrors(returnStatus, conn) :
  ... 
  java.lang.UnsupportedOperationException
    ...

After:

Error in handleErrors(returnStatus, conn) :
 ...
 java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType
    ...

Update documentation about the schema matching for gapply and dapply.

Why are the changes needed?

To show which schema is not matched, and let users know what's going on.

Does this PR introduce any user-facing change?

Yes, error message is updated as above, and documentation is updated.

How was this patch tested?

Manually tested and unitttests were added.

…ization

HyukjinKwon · 2020-07-29T08:40:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala

+        val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())
+        assert(outputTypes == actualDataTypes, "Invalid schema from gapply(): " +
+          s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")
+        batch.rowIterator().asScala


This is same as dapply:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/objects.scala

Lines 247 to 251 in 17586f9

columnarBatchIter.flatMap { batch =>

val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())

assert(outputTypes == actualDataTypes, "Invalid schema from dapply(): " +

s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")

batch.rowIterator.asScala

HyukjinKwon · 2020-07-29T08:45:12Z

@viirya can you take a quick look when you're available?

SparkQA · 2020-07-29T12:55:25Z

Test build #126766 has finished for PR 29283 at commit 8ed454a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-29T13:05:06Z

Test build #126767 has finished for PR 29283 at commit ab5ecde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sparkr.md

viirya

It looks nice, the improved error message. Just one minor comment about the doc.

viirya · 2020-07-30T02:37:40Z

LGTM

HyukjinKwon · 2020-07-30T06:15:47Z

Merged to master and branch-3.0. Thanks @viirya.

…pply with Arrow vectorization ### What changes were proposed in this pull request? This PR proposes to: 1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example, ```R df <- createDataFrame(list(list(a=1L, b="2"))) count(gapply(df, "a", function(key, group) { group }, structType("a int, b int"))) ``` **Before:** ``` Error in handleErrors(returnStatus, conn) : ... java.lang.UnsupportedOperationException ... ``` **After:** ``` Error in handleErrors(returnStatus, conn) : ... java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType ... ``` 2. Update documentation about the schema matching for `gapply` and `dapply`. ### Why are the changes needed? To show which schema is not matched, and let users know what's going on. ### Does this PR introduce _any_ user-facing change? Yes, error message is updated as above, and documentation is updated. ### How was this patch tested? Manually tested and unitttests were added. Closes #29283 from HyukjinKwon/r-vectorized-error. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

SparkQA · 2020-07-30T06:46:22Z

Test build #126798 has finished for PR 29283 at commit 33d0cec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

probot-autolabeler bot added DOCS R SQL labels Jul 29, 2020

Error message to show the schema mismatch in gapply with Arrow vector…

ab5ecde

…ization

HyukjinKwon force-pushed the r-vectorized-error branch from 8ed454a to ab5ecde Compare July 29, 2020 08:39

HyukjinKwon commented Jul 29, 2020

View reviewed changes

viirya reviewed Jul 30, 2020

View reviewed changes

docs/sparkr.md Outdated Show resolved Hide resolved

viirya approved these changes Jul 30, 2020

View reviewed changes

Tweak the words

33d0cec

HyukjinKwon closed this in e1d7321 Jul 30, 2020

HyukjinKwon deleted the r-vectorized-error branch December 7, 2020 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization #29283

[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization #29283

Uh oh!

HyukjinKwon commented Jul 29, 2020

Uh oh!

HyukjinKwon Jul 29, 2020

Uh oh!

HyukjinKwon commented Jul 29, 2020

Uh oh!

SparkQA commented Jul 29, 2020

Uh oh!

SparkQA commented Jul 29, 2020

Uh oh!

Uh oh!

viirya left a comment

Uh oh!

viirya commented Jul 30, 2020

Uh oh!

HyukjinKwon commented Jul 30, 2020

Uh oh!

SparkQA commented Jul 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	columnarBatchIter.flatMap { batch =>
	val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())
	assert(outputTypes == actualDataTypes, "Invalid schema from dapply(): " +
	s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")
	batch.rowIterator.asScala

[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization #29283

[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization #29283

Uh oh!

Conversation

HyukjinKwon commented Jul 29, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 29, 2020

Uh oh!

SparkQA commented Jul 29, 2020

Uh oh!

SparkQA commented Jul 29, 2020

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 30, 2020

Uh oh!

HyukjinKwon commented Jul 30, 2020

Uh oh!

SparkQA commented Jul 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants